This is me working through relevant exercises in R for Data Science by Garrett Grolemund and Hadley Wickham.
filter()Load packages, data.
library(nycflights13)
library(tidyverse)
Select flights that happened on January 1, 2013.
jan1 <- filter(flights, month == 1, day == 1)
jan1
# alternatively, enclose in parens
(jan1 <- filter(flights, month == 1, day == 1))
When dealing with floating point numbers (decimals), use near() rather than == for logical statements.
sqrt(2)^2 == 2
## [1] FALSE
near(sqrt(2)^2, 2)
## [1] TRUE
Instead of writing long disjunctives, use %in%.
nov_dec <- filter(flights, month == 11 | month == 12)
nov_dec <- filter(flights, month %in% c(11, 12))
DeMorgan’s Law:
!(x & y) == (!x | !y)!(x | y) == (!x & !y)If you want to include rows that are NA, you have to do so explicitly (for filter()).
df <- tibble(x = c(1, NA, 3))
filter(df, is.na(x) | x > 1)
## # A tibble: 2 x 1
## x
## <dbl>
## 1 NA
## 2 3
delay <- filter(flights, arr_delay >=120)
summary(delay$arr_delay)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 120.0 137.0 163.0 184.2 207.0 1272.0
houston <- filter(flights, dest %in% c("IAH", "HOU"))
table(houston$dest)
##
## HOU IAH
## 2115 7198
airlines_filt <- filter(flights, carrier %in% c("UA", "AA", "DL"))
table(airlines_filt$carrier)
##
## AA DL UA
## 32729 48110 58665
summer <- filter(flights, month %in% c(7,8,9))
table(summer$month)
##
## 7 8 9
## 29425 29327 27574
arrdelay <- filter(flights, dep_delay <=0 & arr_delay > 120)
arrdelay
## # A tibble: 29 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 27 1419 1420 -1 1754
## 2 2013 10 7 1350 1350 0 1736
## 3 2013 10 7 1357 1359 -2 1858
## 4 2013 10 16 657 700 -3 1258
## 5 2013 11 1 658 700 -2 1329
## 6 2013 3 18 1844 1847 -3 39
## 7 2013 4 17 1635 1640 -5 2049
## 8 2013 4 18 558 600 -2 1149
## 9 2013 4 18 655 700 -5 1213
## 10 2013 5 22 1827 1830 -3 2217
## # ... with 19 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
makeup <- filter(flights, dep_delay >= 60 & (arr_delay < dep_delay - 30))
morning <- filter(flights, dep_time >= 0 & dep_time <= 600)
summary(morning$dep_time)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 540.0 554.0 488.7 557.0 600.0
between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?between() allows you to select a range for a value (e.g., between 7 and 9). It is inclusive on both sides.
summer <- filter(flights, between(month, 7, 9))
morning <- filter(flights, between(dep_time, 0, 600))
dep_time? What other variables are missing? What might these rows represent?missing_dep <- filter(flights, is.na(dep_time))
missing_dep
## # A tibble: 8,255 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 NA 1630 NA NA
## 2 2013 1 1 NA 1935 NA NA
## 3 2013 1 1 NA 1500 NA NA
## 4 2013 1 1 NA 600 NA NA
## 5 2013 1 2 NA 1540 NA NA
## 6 2013 1 2 NA 1620 NA NA
## 7 2013 1 2 NA 1355 NA NA
## 8 2013 1 2 NA 1420 NA NA
## 9 2013 1 2 NA 1321 NA NA
## 10 2013 1 2 NA 1545 NA NA
## # ... with 8,245 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
These flights (n = 8255) are missing actual departure and arrival times, as well as air time and sometimes tail number. These are probably filghts that were cancelled.
NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)Most of those examples don’t take the NA into account – anything to the zero-th is 1, anything OR TRUE is true, anything and FALSE is false. As for NA * 0 – if NA was Inf, it would be NA. I think this basically just means that you should check how different functions use NA before using them on missing data.
arrange()Adding multiple columns is like saying THEN: arrange by year THEN month THEN day.
arrange(flights, year, month, day)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
desc() does descending order.
arrange(flights, desc(dep_delay))
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 9 641 900 1301 1242
## 2 2013 6 15 1432 1935 1137 1607
## 3 2013 1 10 1121 1635 1126 1239
## 4 2013 9 20 1139 1845 1014 1457
## 5 2013 7 22 845 1600 1005 1044
## 6 2013 4 10 1100 1900 960 1342
## 7 2013 3 17 2321 810 911 135
## 8 2013 6 27 959 1900 899 1236
## 9 2013 7 22 2257 759 898 121
## 10 2013 12 5 756 1700 896 1058
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Missing values are at the end (like Rstudio Viewer does it).
df <- tibble(x = c(5, 2, NA))
arrange(df, x)
## # A tibble: 3 x 1
## x
## <dbl>
## 1 2
## 2 5
## 3 NA
is.na()).df <- tibble(x = c(5, 2, NA))
arrange(df, desc(is.na(x)))
## # A tibble: 3 x 1
## x
## <dbl>
## 1 NA
## 2 5
## 3 2
arrange(flights, desc(dep_delay), dep_time)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 9 641 900 1301 1242
## 2 2013 6 15 1432 1935 1137 1607
## 3 2013 1 10 1121 1635 1126 1239
## 4 2013 9 20 1139 1845 1014 1457
## 5 2013 7 22 845 1600 1005 1044
## 6 2013 4 10 1100 1900 960 1342
## 7 2013 3 17 2321 810 911 135
## 8 2013 6 27 959 1900 899 1236
## 9 2013 7 22 2257 759 898 121
## 10 2013 12 5 756 1700 896 1058
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
arrange(flights, air_time)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 16 1355 1315 40 1442
## 2 2013 4 13 537 527 10 622
## 3 2013 12 6 922 851 31 1021
## 4 2013 2 3 2153 2129 24 2247
## 5 2013 2 5 1303 1315 -12 1342
## 6 2013 2 12 2123 2130 -7 2211
## 7 2013 3 2 1450 1500 -10 1547
## 8 2013 3 8 2026 1935 51 2131
## 9 2013 3 18 1456 1329 87 1533
## 10 2013 3 19 2226 2145 41 2305
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Woooo Newark to Bradley – 20 minutes!
arrange(flights, distance)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 7 27 NA 106 NA NA
## 2 2013 1 3 2127 2129 -2 2222
## 3 2013 1 4 1240 1200 40 1333
## 4 2013 1 4 1829 1615 134 1937
## 5 2013 1 4 2128 2129 -1 2218
## 6 2013 1 5 1155 1200 -5 1241
## 7 2013 1 6 2125 2129 -4 2224
## 8 2013 1 7 2124 2129 -5 2212
## 9 2013 1 8 2127 2130 -3 2304
## 10 2013 1 9 2126 2129 -3 2217
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Lol, there’s a cancelled flight between Newark and LaGuardia. Otherwise it’s Newark to Philly as the shortest.
arrange(flights, desc(distance))
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 857 900 -3 1516
## 2 2013 1 2 909 900 9 1525
## 3 2013 1 3 914 900 14 1504
## 4 2013 1 4 900 900 0 1516
## 5 2013 1 5 858 900 -2 1519
## 6 2013 1 6 1019 900 79 1558
## 7 2013 1 7 1042 900 102 1620
## 8 2013 1 8 901 900 1 1504
## 9 2013 1 9 641 900 1301 1242
## 10 2013 1 10 859 900 -1 1449
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Looks like it’s a flight between JFK and Honolulu, which is almost 5,000 miles. It’s only about 9 hours!
select()Select lets you pick out columns.
# Select columns by name
select(flights, year, month, day)
## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ... with 336,766 more rows
# Select all columns between year and day (inclusive)
select(flights, year:day)
## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ... with 336,766 more rows
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))
## # A tibble: 336,776 x 16
## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
## <int> <int> <dbl> <int> <int> <dbl>
## 1 517 515 2 830 819 11
## 2 533 529 4 850 830 20
## 3 542 540 2 923 850 33
## 4 544 545 -1 1004 1022 -18
## 5 554 600 -6 812 837 -25
## 6 554 558 -4 740 728 12
## 7 555 600 -5 913 854 19
## 8 557 600 -3 709 723 -14
## 9 557 600 -3 838 846 -8
## 10 558 600 -2 753 745 8
## # ... with 336,766 more rows, and 10 more variables: carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Helper functions:
starts_with("abc"): matches names that begin with “abc”.ends_with("xyz"): matches names that end with “xyz”.contains("ijk"): matches names that contain “ijk”.matches("(.)\\1"): selects variables that match a regular expression. This one matches any variables that contain repeated characters.num_range("x", 1:3): matches x1, x2 and x3.everything(): selects columns you haven’t mentioned yet.# moves time_hour and air_time to the front of the tibble
select(flights, time_hour, air_time, everything())
## # A tibble: 336,776 x 19
## time_hour air_time year month day dep_time sched_dep_time
## <dttm> <dbl> <int> <int> <int> <int> <int>
## 1 2013-01-01 05:00:00 227 2013 1 1 517 515
## 2 2013-01-01 05:00:00 227 2013 1 1 533 529
## 3 2013-01-01 05:00:00 160 2013 1 1 542 540
## 4 2013-01-01 05:00:00 183 2013 1 1 544 545
## 5 2013-01-01 06:00:00 116 2013 1 1 554 600
## 6 2013-01-01 05:00:00 150 2013 1 1 554 558
## 7 2013-01-01 06:00:00 158 2013 1 1 555 600
## 8 2013-01-01 06:00:00 53 2013 1 1 557 600
## 9 2013-01-01 06:00:00 140 2013 1 1 557 600
## 10 2013-01-01 06:00:00 138 2013 1 1 558 600
## # ... with 336,766 more rows, and 12 more variables: dep_delay <dbl>,
## # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## # hour <dbl>, minute <dbl>
rename() is useful to rename columns.
rename(flights, tail_num = tailnum)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tail_num <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
select(flights, dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 x 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ... with 336,766 more rows
select(flights, starts_with("dep"), starts_with("arr"))
## # A tibble: 336,776 x 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ... with 336,766 more rows
select(flights, c(4,6,7,9))
## # A tibble: 336,776 x 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ... with 336,766 more rows
select(flights, year, year, day, year)
## # A tibble: 336,776 x 2
## year day
## <int> <int>
## 1 2013 1
## 2 2013 1
## 3 2013 1
## 4 2013 1
## 5 2013 1
## 6 2013 1
## 7 2013 1
## 8 2013 1
## 9 2013 1
## 10 2013 1
## # ... with 336,766 more rows
It only returns it the one time.
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
select(flights, one_of(vars))
## # A tibble: 336,776 x 5
## year month day dep_delay arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 2 11
## 2 2013 1 1 4 20
## 3 2013 1 1 2 33
## 4 2013 1 1 -1 -18
## 5 2013 1 1 -6 -25
## 6 2013 1 1 -4 12
## 7 2013 1 1 -5 19
## 8 2013 1 1 -3 -14
## 9 2013 1 1 -3 -8
## 10 2013 1 1 -2 8
## # ... with 336,766 more rows
It’s kind of line %in% for select – if the column matches one of the names in the vector, it is returned.
select(flights, contains("TIME"))
## # A tibble: 336,776 x 6
## dep_time sched_dep_time arr_time sched_arr_time air_time
## <int> <int> <int> <int> <dbl>
## 1 517 515 830 819 227
## 2 533 529 850 830 227
## 3 542 540 923 850 160
## 4 544 545 1004 1022 183
## 5 554 600 812 837 116
## 6 554 558 740 728 150
## 7 555 600 913 854 158
## 8 557 600 709 723 53
## 9 557 600 838 846 140
## 10 558 600 753 745 138
## # ... with 336,766 more rows, and 1 more variable: time_hour <dttm>
It looks like case doesn’t matter for select. You can change it wil ignore.case = FALSE.
select(flights, contains("TIME", ignore.case = FALSE))
## # A tibble: 336,776 x 0
mutate()Mutate makes new variables and adds them to the end of the tibble. You can use these new variables to create even more variables all in the same line.
flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance,
air_time
)
mutate(flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
hours = air_time / 60,
gain_per_hour = gain / hours
)
## # A tibble: 336,776 x 11
## year month day dep_delay arr_delay distance air_time gain speed
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 2 11 1400 227 -9 370.
## 2 2013 1 1 4 20 1416 227 -16 374.
## 3 2013 1 1 2 33 1089 160 -31 408.
## 4 2013 1 1 -1 -18 1576 183 17 517.
## 5 2013 1 1 -6 -25 762 116 19 394.
## 6 2013 1 1 -4 12 719 150 -16 288.
## 7 2013 1 1 -5 19 1065 158 -24 404.
## 8 2013 1 1 -3 -14 229 53 11 259.
## 9 2013 1 1 -3 -8 944 140 5 405.
## 10 2013 1 1 -2 8 733 138 -10 319.
## # ... with 336,766 more rows, and 2 more variables: hours <dbl>,
## # gain_per_hour <dbl>
If you only want to keep the new variables, use transmute().
transmute(flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
hours = air_time / 60,
gain_per_hour = gain / hours
)
## # A tibble: 336,776 x 4
## gain speed hours gain_per_hour
## <dbl> <dbl> <dbl> <dbl>
## 1 -9 370. 3.78 -2.38
## 2 -16 374. 3.78 -4.23
## 3 -31 408. 2.67 -11.6
## 4 17 517. 3.05 5.57
## 5 19 394. 1.93 9.83
## 6 -16 288. 2.5 -6.4
## 7 -24 404. 2.63 -9.11
## 8 11 259. 0.883 12.5
## 9 5 405. 2.33 2.14
## 10 -10 319. 2.3 -4.35
## # ... with 336,766 more rows
You can use any function with mutate() as long as it is vectorized: it must take a vector of values as input, return a vector with the same number of values as output.
Useful functions:
%/% (integer division) and %% (remainder), where x == y * (x %/% y) + (x %% y)lead() and lag() allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. x - lag(x)) or find when values change (x != lag(x)).cumsum(), cumprod(), cummin(), cummax(); and dplyr provides cummean() for cumulative means.min_rank(). If min_rank() doesn’t do what you need, look at the variantsrow_number(),dense_rank(),percent_rank(),cume_dist(),ntile()`dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.flights_sml <- select(flights, contains("dep"))
mutate(flights_sml,
dep_min_since_midn = (dep_time %/% 100) * 60 + (dep_time %% 100),
sched_dep_min_since_midn = (sched_dep_time %/% 100) * 60 + (sched_dep_time %% 100))
## # A tibble: 336,776 x 5
## dep_time sched_dep_time dep_delay dep_min_since_midn sched_dep_min_sin…
## <int> <int> <dbl> <dbl> <dbl>
## 1 517 515 2 317 315
## 2 533 529 4 333 329
## 3 542 540 2 342 340
## 4 544 545 -1 344 345
## 5 554 600 -6 354 360
## 6 554 558 -4 354 358
## 7 555 600 -5 355 360
## 8 557 600 -3 357 360
## 9 557 600 -3 357 360
## 10 558 600 -2 358 360
## # ... with 336,766 more rows
air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?mutate(select(flights, arr_time, dep_time, air_time),
tot_time = arr_time - dep_time)
## # A tibble: 336,776 x 4
## arr_time dep_time air_time tot_time
## <int> <int> <dbl> <int>
## 1 830 517 227 313
## 2 850 533 227 317
## 3 923 542 160 381
## 4 1004 544 183 460
## 5 812 554 116 258
## 6 740 554 150 186
## 7 913 555 158 358
## 8 709 557 53 152
## 9 838 557 140 281
## 10 753 558 138 195
## # ... with 336,766 more rows
I think I have to change arr_time and dep_time to continuous values.
mutate(select(flights, dep_time, arr_time, air_time),
dep_cont = (dep_time %/% 100)* 60 + (dep_time %% 100),
arr_cont = (arr_time %/% 100) * 60 + (arr_time %% 100),
tot_time = arr_cont - dep_cont)
## # A tibble: 336,776 x 6
## dep_time arr_time air_time dep_cont arr_cont tot_time
## <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 517 830 227 317 510 193
## 2 533 850 227 333 530 197
## 3 542 923 160 342 563 221
## 4 544 1004 183 344 604 260
## 5 554 812 116 354 492 138
## 6 554 740 150 354 460 106
## 7 555 913 158 355 553 198
## 8 557 709 53 357 429 72
## 9 557 838 140 357 518 161
## 10 558 753 138 358 473 115
## # ... with 336,766 more rows
This does not deal with flights that take off before midnight one night and land the next day. I would probably have to filter for those rows and do something different (maybe using lubridate?).
dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?dep <- select(flights, dep_time, sched_dep_time, dep_delay)
mutate(dep,
dep_cont = (dep_time %/% 100)* 60 + (dep_time %% 100),
sched_dep_cont = (sched_dep_time %/% 100)* 60 + (sched_dep_time %% 100),
calc_dep_delay = dep_cont - sched_dep_cont)
## # A tibble: 336,776 x 6
## dep_time sched_dep_time dep_delay dep_cont sched_dep_cont
## <int> <int> <dbl> <dbl> <dbl>
## 1 517 515 2 317 315
## 2 533 529 4 333 329
## 3 542 540 2 342 340
## 4 544 545 -1 344 345
## 5 554 600 -6 354 360
## 6 554 558 -4 354 358
## 7 555 600 -5 355 360
## 8 557 600 -3 357 360
## 9 557 600 -3 357 360
## 10 558 600 -2 358 360
## # ... with 336,766 more rows, and 1 more variable: calc_dep_delay <dbl>
dep_delay is a subtraction between continuous versions of dep_time andsched_dep_time`.
delay <- select(flights, arr_delay)
arrange(mutate(delay,
delay_rank = min_rank(arr_delay)), desc(delay_rank))
## # A tibble: 336,776 x 2
## arr_delay delay_rank
## <dbl> <int>
## 1 1272 327346
## 2 1127 327345
## 3 1109 327344
## 4 1007 327343
## 5 989 327342
## 6 931 327341
## 7 915 327340
## 8 895 327339
## 9 878 327338
## 10 875 327337
## # ... with 336,766 more rows
1:3 + 1:10 return? Why?1:3 + 1:10
## [1] 2 4 6 5 7 9 8 10 12 11
It produces longer object length is not a multiple of shorter object length. This is because we’re trying to add two vectors of different lengths, not all of the objects within them.
R has the following:
cos(x)
sin(x)
tan(x)
acos(x)
asin(x)
atan(x)
atan2(y, x)
# cospi(x), sinpi(x), and tanpi(x), compute cos(pi*x), sin(pi*x), and tan(pi*x)
cospi(x)
sinpi(x)
tanpi(x)
summarise()summarise() collapses a data frame to a single row. It’s especially useful for grouped summaries.
by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 365 x 4
## # Groups: year, month [?]
## year month day delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 11.5
## 2 2013 1 2 13.9
## 3 2013 1 3 11.0
## 4 2013 1 4 8.95
## 5 2013 1 5 5.73
## 6 2013 1 6 7.15
## 7 2013 1 7 5.42
## 8 2013 1 8 2.55
## 9 2013 1 9 2.28
## 10 2013 1 10 2.84
## # ... with 355 more rows
To make this code easier, we can use the pipe %>%.
# by destination, count the number of flights, mean distance, and mean arrival delay.
## only use locations with more than 20 flights that are not HNL.
delays <- flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL")
For most aggregation functions, if there is any missing data it will return NA. To avoid this, us na.rm = TRUE.
Also, it’s useful to always include a count n() or count of non-missing values sum(!is.na(x)) when looking at aggregated data, to make sure you’re not drawing conclusions on a small amount of data.
Useful summary functions:
mean(x), median(x)sd(x), IQR(x), mad(x) (last two robust to outliers)min(x), quantile(x, 0.25), max(x)first(x), nth(x, 2), last(x)n(), sum(!is.na(x)), n_distinct(x) (unique values)sum(x > 10), mean(y == 0). Here, sum(x) gives the number of TRUEs in x, and mean(x) gives the proportion.There’s also a special function for counts that takes a weighting variable if needed.
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
# sums total number of miles a given plane flew
not_cancelled %>%
count(tailnum, wt = distance)
## # A tibble: 4,037 x 2
## tailnum n
## <chr> <dbl>
## 1 D942DN 3418
## 2 N0EGMQ 239143
## 3 N10156 109664
## 4 N102UW 25722
## 5 N103US 24619
## 6 N104UW 24616
## 7 N10575 139903
## 8 N105UW 23618
## 9 N107US 21677
## 10 N108UW 32070
## # ... with 4,027 more rows
ungroup() reverses group_by().
For the 50/50 flights, if we just take the mean of the delay we will get 0, which implies that the flight is basically always on time – but that’s not true. Instead we probably want to have some kind of measure of the variability in delay time as well as the mean delay time.
not_cancelled %>%
group_by(carrier, flight) %>%
summarise(count = n(),
m_delay = mean(arr_delay, na.rm = TRUE),
sd_delay = sd(arr_delay, na.rm = TRUE)) %>%
filter(count > 25) %>%
ggplot(mapping = aes(x = m_delay, y = sd_delay)) +
geom_point(alpha = 0.5) + theme_bw()
Here we can see that the most consistent flights are also the ones that tend to arrive early. We can do the same thing with dep_delay.
not_cancelled %>%
group_by(carrier, flight) %>%
summarise(count = n(),
m_delay = mean(dep_delay, na.rm = TRUE),
sd_delay = sd(dep_delay, na.rm = TRUE)) %>%
filter(count > 25) %>%
ggplot(mapping = aes(x = m_delay, y = sd_delay)) +
geom_point(alpha = 0.5) + theme_bw()
not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()).# number of flights by destination
not_cancelled %>%
group_by(dest) %>%
summarise(count = n())
## # A tibble: 104 x 2
## dest count
## <chr> <int>
## 1 ABQ 254
## 2 ACK 264
## 3 ALB 418
## 4 ANC 8
## 5 ATL 16837
## 6 AUS 2411
## 7 AVL 261
## 8 BDL 412
## 9 BGR 358
## 10 BHM 269
## # ... with 94 more rows
# total number of miles by plane
not_cancelled %>%
group_by(tailnum) %>%
summarise(tot_mi = sum(distance, na.rm = TRUE))
## # A tibble: 4,037 x 2
## tailnum tot_mi
## <chr> <dbl>
## 1 D942DN 3418
## 2 N0EGMQ 239143
## 3 N10156 109664
## 4 N102UW 25722
## 5 N103US 24619
## 6 N104UW 24616
## 7 N10575 139903
## 8 N105UW 23618
## 9 N107US 21677
## 10 N108UW 32070
## # ... with 4,027 more rows
(is.na(dep_delay) | is.na(arr_delay)) is slightly suboptimal. Why? Which is the most important column?I think in this case the arr_delay column is more important, because flights can be delayed multiple times before they are actually cancelled. dep_time might just be the latest time the flight was expected to take off before it was cancelled. If arr_delay or arr_time is missing, this implies that the flight never landed in its destination. This could mean one of a few things:
In the case of the second option, we don’t have information about where a diverted flight actually did land, so for our purposes those flights are as good as cancelled. So, it seems like the best option here is to only use flights that have comlete data for arr_delay.
flights %>%
group_by(month, day) %>%
summarise(count = n(),
cancelled = sum(is.na(arr_delay)),
mean_delay = mean(arr_delay, na.rm = TRUE)) %>%
ggplot(mapping = aes(x = cancelled, y = mean_delay)) +
geom_point(alpha = 0.5) + theme_bw() + geom_smooth(se = FALSE)
Intuitively, it would make sense that days with worse weather would have both more cancelled flights and more delayed flights. It seems like it’s a U-shaped curve – likely, once the weather gets so bad, all the flights get cancelled rather than some delayed and some cancelled.
flights %>% group_by(carrier, dest) %>% summarise(n()))flights %>%
group_by(carrier, dest) %>%
summarise(count = n(),
mean_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(desc(mean_delay))
## # A tibble: 314 x 4
## # Groups: carrier [16]
## carrier dest count mean_delay
## <chr> <chr> <int> <dbl>
## 1 UA STL 2 110
## 2 OO ORD 1 107
## 3 OO DTW 2 68.5
## 4 UA RDU 1 56
## 5 EV CAE 113 42.8
## 6 EV TYS 323 41.2
## 7 EV PBI 6 40.7
## 8 EV TUL 315 33.7
## 9 EV OKC 346 30.6
## 10 UA JAC 23 29.9
## # ... with 304 more rows
flights %>%
group_by(dest) %>%
summarise(count = n(),
mean_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(desc(mean_delay))
## # A tibble: 105 x 3
## dest count mean_delay
## <chr> <int> <dbl>
## 1 CAE 116 41.8
## 2 TUL 315 33.7
## 3 OKC 346 30.6
## 4 JAC 25 28.1
## 5 TYS 631 24.1
## 6 MSN 572 20.2
## 7 RIC 2454 20.1
## 8 CAK 864 19.7
## 9 DSM 569 19.0
## 10 GRR 765 18.2
## # ... with 95 more rows
Looks like ExpressJet Airlines Inc. has pretty bad delays at lots of airports – CAE, TYS, PBI, TUL, OKC. United and SkyWest are also not great.
sort() sorts the output in descending order. This might be helpful if you want to look at the head of a table without using arrange().
Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.
Which plane (tailnum) has the worst on-time record?
flights %>%
mutate(canceled = is.na(arr_time),
late = !canceled & arr_delay > 0) %>%
filter(canceled == FALSE) %>%
group_by(tailnum) %>%
summarise(count = n(),
n_delay = sum(late, na.rm = TRUE),
prop_delay = n_delay/count) %>%
mutate(prop_rank = min_rank(desc(prop_delay))) %>%
filter(count > 5) %>%
arrange(prop_rank)
## # A tibble: 3,575 x 5
## tailnum count n_delay prop_delay prop_rank
## <chr> <int> <int> <dbl> <int>
## 1 N337AT 13 12 0.923 105
## 2 N169AT 11 10 0.909 106
## 3 N168AT 18 16 0.889 107
## 4 N261AT 9 8 0.889 107
## 5 N284AT 9 8 0.889 107
## 6 N290AT 16 14 0.875 110
## 7 N551NW 7 6 0.857 111
## 8 N273AT 13 11 0.846 112
## 9 N299AT 6 5 0.833 113
## 10 N388SW 6 5 0.833 113
## # ... with 3,565 more rows
This table summarizes the planes with the highest percentage of delayed flights (excl. cancelled flights). I’ve only included planes who have flown at least 5 times, since many planes have flown only once and were delayed. This shows that plane N337AT was delayed on 12/13 of its flights in 2013.
flights %>%
group_by(hour) %>%
summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(x = hour, y = avg_delay)) + geom_bar(stat = "identity") + theme_bw()
In this question I’m using dep_delay instead of arr_delay because it seems like you would want to avoid delays at the gate when choosing flight time. It looks like early morning (5am) is the best time – average delay is less than 1 minute! This makes sense, because delays accumulate throughout the day as flights come in late and then leave late.
flights %>%
filter(!is.na(arr_delay), arr_delay > 0) %>%
group_by(dest) %>%
mutate(total_delay = sum(arr_delay),
prop_delay = arr_delay / total_delay) %>%
select(flight, dest, total_delay, prop_delay)
## # A tibble: 133,004 x 4
## # Groups: dest [103]
## flight dest total_delay prop_delay
## <int> <chr> <dbl> <dbl>
## 1 1545 IAH 99391 0.000111
## 2 1714 IAH 99391 0.000201
## 3 1141 MIA 140424 0.000235
## 4 1696 ORD 283046 0.0000424
## 5 507 FLL 202605 0.0000938
## 6 301 ORD 283046 0.0000283
## 7 194 LAX 203226 0.0000344
## 8 707 DFW 110009 0.000282
## 9 4650 ATL 300299 0.0000400
## 10 4401 DTW 138258 0.000116
## # ... with 132,994 more rows
flights %>%
arrange(origin, month, day, dep_time) %>% # arrange in order
group_by(origin) %>% # group by airport to avoid cross-airport comparisons
mutate(prev_delay = lag(dep_delay)) %>% # create lagged delay column
group_by(dep_delay) %>% # otherwise, they'll have the same value -- this preserves lag
summarise(m_prev_delay = mean(prev_delay, na.rm = TRUE)) %>% # find mean delay for previous flight
ggplot(aes(dep_delay, m_prev_delay)) + geom_point() + theme_bw() + xlab("Current Flight Delay") + ylab("Prev Flight Delay")
Here we can see that there is a pretty strong relationship between the previous flight’s delay and the current flight’s delay until we get to about a 5 hour delay, at which point anything goes.
flights %>%
group_by(origin, dest) %>%
mutate(med_air_time = median(air_time, na.rm = TRUE)) %>%
select(month, day, flight, origin, dest, air_time, med_air_time) %>%
mutate(diff = air_time - med_air_time) %>%
arrange(diff)
## # A tibble: 336,776 x 8
## # Groups: origin, dest [224]
## month day flight origin dest air_time med_air_time diff
## <int> <int> <int> <chr> <chr> <dbl> <dbl> <dbl>
## 1 7 2 4667 EWR MSP 93 149 -56
## 2 7 14 673 EWR SNA 274 329 -55
## 3 9 6 771 JFK LAX 275 329 -54
## 4 7 3 161 JFK SEA 275 328 -53
## 5 7 12 21 JFK LAX 277 329 -52
## 6 7 13 485 EWR SNA 279 329 -50
## 7 7 15 329 EWR SNA 279 329 -50
## 8 8 27 797 JFK LAX 279 329 -50
## 9 7 15 712 JFK LAX 280 329 -49
## 10 9 28 15 EWR HNL 562 611 -49
## # ... with 336,766 more rows
If we compare the air time for each flight to the median air time for that origin-destination pair, we see that the fastest flights are only about an hour quicker than a normal flight between those airports. This doesn’t necessarily indicate an error in data entry. You can note that most of these quick flights happen in July – maybe when the weather is pretty clear.
flights %>%
group_by(origin, dest) %>%
mutate(med_air_time = median(air_time, na.rm = TRUE)) %>%
select(month, day, flight, origin, dest, air_time, med_air_time) %>%
mutate(diff = air_time - med_air_time) %>%
arrange(desc(diff))
## # A tibble: 336,776 x 8
## # Groups: origin, dest [224]
## month day flight origin dest air_time med_air_time diff
## <int> <int> <int> <chr> <chr> <dbl> <dbl> <dbl>
## 1 7 28 841 JFK SFO 490 347 143
## 2 1 28 575 JFK EGE 382 255 127
## 3 11 22 426 JFK LAX 440 329 111
## 4 9 10 745 LGA DEN 331 227 104
## 5 11 22 587 EWR LAS 399 298 101
## 6 6 29 1491 JFK ACK 141 41 100
## 7 1 2 4204 EWR OKC 286 190 96
## 8 1 15 4204 EWR OKC 284 190 94
## 9 7 10 17 JFK LAX 422 329 93
## 10 12 6 434 JFK SFO 438 347 91
## # ... with 336,766 more rows
If we look at the opposite, we can see some flights that were in the air a lot longer than average. For example, the flight on 7/28 from JFK to SFO was in the air over 2 hours more than average.
airports_carriers <- flights %>%
group_by(dest) %>%
summarise(n_carriers = n_distinct(carrier)) %>%
filter(n_carriers > 2) %>%
arrange(n_carriers)
flights %>%
filter(dest %in% airports_carriers$dest) %>%
group_by(carrier) %>%
summarise(n_dest = n_distinct(dest)) %>%
arrange(desc(n_dest))
## # A tibble: 15 x 2
## carrier n_dest
## <chr> <int>
## 1 DL 37
## 2 EV 36
## 3 UA 36
## 4 9E 35
## 5 B6 30
## 6 AA 17
## 7 MQ 17
## 8 WN 9
## 9 OO 5
## 10 US 5
## 11 VX 3
## 12 YV 3
## 13 FL 2
## 14 AS 1
## 15 F9 1
It looks like Delta flies to the most multi-carrier airports, followed by ExpressJet and United.
flights %>%
filter(!is.na(air_time)) %>%
arrange(tailnum, month, day, dep_time) %>%
mutate(one_hour = ifelse(dep_delay > 60, 1,0)) %>%
group_by(tailnum) %>%
mutate(cumsum = cumsum(one_hour)) %>%
filter(cumsum == 0) %>%
summarise(count = n())
## # A tibble: 3,804 x 2
## tailnum count
## <chr> <int>
## 1 N0EGMQ 53
## 2 N10156 9
## 3 N102UW 25
## 4 N103US 46
## 5 N104UW 3
## 6 N105UW 22
## 7 N107US 20
## 8 N108UW 36
## 9 N109UW 48
## 10 N110UW 15
## # ... with 3,794 more rows
Two main questions for EDA:
What type of variation occurs within my variables?
What type of covariation occurs between my variables?
Variation is how a variable changes between measurements.
Categorical variables can be visualized using a bar chart.
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut)) + theme_bw()
Continuous variables can be visualized using a histogram.
ggplot(data = diamonds) + geom_histogram(mapping = aes(x = carat), binwidth = 0.5) + theme_bw()
To view multiple distributions of continuous variables at once, use geom_freqpoly().
smaller <- diamonds %>%
filter(carat < 3)
ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
geom_freqpoly(binwidth = 0.1) + theme_bw()
Things to look out for: * Typical values: Why are some values more common than others? Why are some values non-existent in the dataset? Why do some values seem to cluster together? * Unusual values: What outliers exist? Does your analysis change if you exclude them?
library(gridExtra)
x <- ggplot(diamonds, aes(x)) + geom_histogram() + theme_bw()
y <- ggplot(diamonds, aes(y)) + geom_histogram() + theme_bw()
z <- ggplot(diamonds, aes(z)) + geom_histogram() + theme_bw()
grid.arrange(x,y,z,ncol=1)
The documentation says that x is length, y is width, and z is depth. This suggests that most of the diamonds are not cut in round or square shapes where the length and width would be very similar. More information about the cut shape of the diamonds would help us determine this wihout the documentation.
ggplot(diamonds, aes(price)) + geom_histogram() + theme_bw()
Nothing super weird about this plot right now.
ggplot(diamonds, aes(price)) + geom_histogram(binwidth = 10) + theme_bw()
Making the bind size smaller shows us a weird gap somewhere between 0 and 2500 dollars. Let’s look at that.
lessthanthreek <- filter(diamonds, price < 3000)
ggplot(lessthanthreek, aes(price)) + geom_histogram(binwidth = 10) + theme_bw()
There’s a very specific range of prices around $1500 where there are no cases. That’s weird!
diamonds %>%
filter(carat == 0.99 | carat == 1) %>%
group_by(carat) %>%
summarise(count = n())
## # A tibble: 2 x 2
## carat count
## <dbl> <int>
## 1 0.99 23
## 2 1 1558
There are a lot more diamonds that are 1 carat than those that are 0.99. This is probably because people want to buy a 1-carat diamond (as opposed to a 0.99) – jewelers might be rounding up from 0.99 to 1.
ggplot(diamonds, aes(price)) + geom_histogram() + theme_bw() + coord_cartesian(xlim = c(200, 1000))
ggplot(diamonds, aes(price)) + geom_histogram() + theme_bw() + xlim(200, 1000)
It seems like coord_cartesian is simply zooming in on the larger plot – the data extends past the last tick mark, and there is only one bin (because the 30 total bins come from the entire dataset). However, xlim (and presumably ylim) subsets the data to those that fit the limits defined. So, our data between 200 and 1000 is binned into 30 pieces using xlim.
Sometimes when we have outlier/weird/obviously-an-error data points, we will set those values to NA rather than drop the entire case. However, we have to think about how R is going to deal with missing data after that point.
mis_diamonds <- diamonds %>%
mutate(price_miss = ifelse(price > 750 & price < 1900, NA, price))
ggplot(mis_diamonds, aes(price_miss)) + geom_histogram(binwidth = 10) + theme_bw()
In a histogram, the missing values are just removed.
mis_diamonds <- diamonds %>%
mutate(color_miss = ifelse(color == "H", NA, as.character(color)))
ggplot(mis_diamonds, aes(as.factor(color_miss))) + geom_bar() + xlab("Color") + theme_bw()
In a bar chart, they’re plotted as NA, so you can see how many missing values you have.
na.rm = TRUE do in mean() and sum()?For both of these functions, na.rm = TRUE removes NA values before doing the calculation.
Covariation is the tendency for the values of two or more variables to vary together in a related way. We’ll look at this by checking out some plots.
Boxplots are good for looking at the relationship between one categorical and one continuous variable.
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) + theme_bw() + xlab("car type")
For two categorical variables, we can make one of those cool grids.
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))
When we are dealing with two continuous variables, we default to geom_point(). But it’s hard to read scatterplots when we have huge datasets. Other ways we can plot are with geom_bin2d() or geom_hex(). We can also bin a continuous variable and look at it in some boxplots.
ggplot(data = smaller) +
geom_bin2d(mapping = aes(x = carat, y = price))
library(hexbin)
ggplot(data = smaller) +
geom_hex(mapping = aes(x = carat, y = price))
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(cancelled, sched_dep_time)) + geom_boxplot() + theme_bw()
Generally, cancelled flights were scheduled to leave later than non-cancelled flights
I would expect that the variable most related to price is carat. Let’s look at how it relates to cut.
ggplot(diamonds, aes(cut, carat)) + geom_boxplot() + theme_bw()
It looks like the biggest diamonds are also the ones that are the biggest. That is probably why the lower quality diamonds are the most expensive.
ggstance package, and create a horizontal boxplot. How does this compare to using coord_flip()?library(ggstance)
ggplot(diamonds, aes(carat, cut)) + geom_boxploth() + theme_bw()
ggplot(diamonds, aes(cut, carat)) + geom_boxplot() + theme_bw() + coord_flip()
It looks like ggstance allows you to list the aesthetic mappings in the opposite way.
lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots?library(lvplot)
ggplot(diamonds, aes(cut, price)) + geom_lv(aes(fill=..LV..)) + theme_bw() + scale_fill_lv()
The LV plot is fine (I guess), but I’d rather just use a violin plot if I want to know something about distributions.
ggplot(diamonds, aes(cut, price)) + geom_violin() + theme_bw()
geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method?ggplot(diamonds, aes(cut, price)) + geom_violin() + theme_bw()
ggplot(diamonds, aes(price)) + geom_histogram() + theme_bw() + facet_wrap(~cut)
ggplot(data = diamonds, mapping = aes(x = price)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500) + theme_bw()
Violin plots are great if you want to compare distributions side-by-side, as you can get a sense for central tendency as well as spread. I don’t like faceted histograms that much at all (esp. because that’s what a violin plot is, basically). The geom_freqpoly() is great for seeing relative frequencies, which are harder to see in violin plots.
geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.Not going to explain these, but it’s a cool package to point out.
# cut within color
diamonds %>%
count(color, cut) %>%
group_by(color) %>%
mutate(prop = n / sum(n)) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = prop))
# color within cut
diamonds %>%
count(color, cut) %>%
group_by(cut) %>%
mutate(prop = n / sum(n)) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = prop))
flights %>%
group_by(dest, month) %>%
summarise(m_delay = mean(arr_delay, na.rm = TRUE)) %>%
ggplot(mapping = aes(x = dest, y = month)) +
geom_tile(mapping = aes(fill = m_delay))
First, there are a lot of destinations with no flights for a given month. We can change that by filtering. It’s also hard to read the continuous scale. I’ll change that by binning it.
flights %>%
group_by(dest) %>%
mutate(n_flights = n()) %>%
filter(n_flights > 15) %>%
group_by(dest, month) %>%
summarise(m_delay = mean(arr_delay, na.rm = TRUE)) %>%
mutate(delay_bin = ifelse(m_delay < 0, "Early",
ifelse(m_delay > 0 & m_delay < 30, "0-30 min late",
ifelse(m_delay >30 & m_delay < 60, "30-60 min late",
ifelse(m_delay > 60, "90 min late", NA))))) %>%
mutate(delay_bin = factor(delay_bin, levels = c("Early", "0-30 min late", "30-60 min late", "90 min late"))) %>%
ggplot(mapping = aes(x = as.factor(month), y = dest)) +
geom_tile(mapping = aes(fill = delay_bin)) + scale_fill_brewer(direction = 1) + xlab("Month")
It would be nice to have a way to filter the relevant destinations as well.
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = cut, y = color)) +
geom_tile(mapping = aes(fill = n))
There are more colors, and it’s easier to read them along the x-axis.
cut_width() vs cut_number()? How does that impact a visualisation of the 2d distribution of carat and price?ggplot(data = diamonds,
mapping = aes(color = cut_number(carat, 5), x = price)) +
geom_freqpoly() + theme_bw()
cut_width() makes a certain number of bins, so you need to make sure they give you the right resolution on your data.
ggplot(data = diamonds,
mapping = aes(color = cut_width(carat, 1), x = price)) +
geom_freqpoly() + theme_bw()
cut_number() makes bins of a certain width, so again, think about how you want to think about your data.
ggplot(diamonds, aes(x = price, y = carat)) +
geom_boxplot(mapping = aes(group = cut_width(price, 1000))) + theme_bw()
Very small diamonds have a much smaller distribution than very large diamonds. I would assume that for small diamonds, there is only so much variation that quality can give – a really great tiny diamond probably costs something similar to a not-so-great tiny diamond.
ggplot(diamonds, aes(x = cut_number(price,5), y = carat, fill = cut)) +
geom_boxplot() + theme_bw()
ggplot(data = diamonds) +
geom_point(mapping = aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
Why is a scatterplot a better display than a binned plot for this case?
ggplot(data = diamonds) +
geom_hex(mapping = aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
Since most of the data are on the same line, it’s hard to read the binned plot here. We’re looking for individual outlier points, which are hard to see on this type of plot.
General ideas:
Keep everything you do in your analysis in the script. If you run things right in the console, assume you will forget them later.
Instead of using setwd() all the time, use an R project.
I think R projects are just basically directories that are structured in a standard way, where you use relative paths rather than absolute paths.
as.tibble works like as.data.frame() tibble() works like data.frame(), except that it doesn’t change the type of the input (e.g., string to factor)
You can use tribble() to make creating a data frame easier to read.
tribble(
~x, ~y, ~z,
#--|--|----
"a", 2, 3.6,
"b", 1, 8.5
)
## # A tibble: 2 x 3
## x y z
## <chr> <dbl> <dbl>
## 1 a 2 3.6
## 2 b 1 8.5
By default, a tibble will only print the first 10 rows of a table and only those columns that fit on the screen. You can modify this if you want. width = Inf will print all the columns.
nycflights13::flights %>%
print(n = 10, width = Inf)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl>
## 1 819 11 UA 1545 N14228 EWR IAH 227
## 2 830 20 UA 1714 N24211 LGA IAH 227
## 3 850 33 AA 1141 N619AA JFK MIA 160
## 4 1022 -18 B6 725 N804JB JFK BQN 183
## 5 837 -25 DL 461 N668DN LGA ATL 116
## 6 728 12 UA 1696 N39463 EWR ORD 150
## 7 854 19 B6 507 N516JB EWR FLL 158
## 8 723 -14 EV 5708 N829AS LGA IAD 53
## 9 846 -8 B6 79 N593JB JFK MCO 140
## 10 745 8 AA 301 N3ALAA LGA ORD 138
## distance hour minute time_hour
## <dbl> <dbl> <dbl> <dttm>
## 1 1400 5 15 2013-01-01 05:00:00
## 2 1416 5 29 2013-01-01 05:00:00
## 3 1089 5 40 2013-01-01 05:00:00
## 4 1576 5 45 2013-01-01 05:00:00
## 5 762 6 0 2013-01-01 06:00:00
## 6 719 5 58 2013-01-01 05:00:00
## 7 1065 6 0 2013-01-01 06:00:00
## 8 229 6 0 2013-01-01 06:00:00
## 9 944 6 0 2013-01-01 06:00:00
## 10 733 6 0 2013-01-01 06:00:00
## # ... with 3.368e+05 more rows
mtcars, which is a regular data frame).It will print differently – in particular, tibbles will say A tibble at the top.
data.frame and equivalent tibble. What is different? Why might the default data frame behaviours cause you frustration?df <- data.frame(abc = 1, xyz = "a")
df_tibble <- tibble(abc = 1, xyz = "a")
df$x
## [1] a
## Levels: a
df_tibble$x
## NULL
In this one, the data.frame returned the most similar column (xyz), while the tibble returned an error. Tibbles don’t do partial matching. A partial match could lead to errors – e.g., if you think you are calling column x but actually it was never created, so instead you are calling xyz.
class(df[, "xyz"])
## [1] "factor"
class(df_tibble[, "xyz"])
## [1] "tbl_df" "tbl" "data.frame"
Here, the data.frame returns a vector of a factor with one level (a), while the tibble returns a character tibble.
df[, c("abc", "xyz")]
## abc xyz
## 1 1 a
df_tibble[, c("abc", "xyz")]
## # A tibble: 1 x 2
## abc xyz
## <dbl> <chr>
## 1 1 a
In this case they return the same thing.
var <- "mpg", how can you extract the reference variable from a tibble?var <- "mpg"
mtcars %>% .[[var]]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
annoying <- tibble(
`1` = 1:10,
`2` = `1` * 2 + rnorm(length(`1`))
)
annoying$`1`
## [1] 1 2 3 4 5 6 7 8 9 10
annoying %>%
ggplot(aes(`1`, `2`)) + geom_point()
annoying <- annoying %>%
mutate(`3` = `2`/`1`)
names(annoying) <- c("one", "two", "three")
annoying
## # A tibble: 10 x 3
## one two three
## <int> <dbl> <dbl>
## 1 1 0.714 0.714
## 2 2 3.67 1.83
## 3 3 5.89 1.96
## 4 4 6.98 1.74
## 5 5 9.64 1.93
## 6 6 12.7 2.11
## 7 7 14.4 2.06
## 8 8 16.4 2.04
## 9 9 18.5 2.06
## 10 10 18.7 1.87
tibble::enframe() do? When might you use it?enframe() takes a named vector and turns it into a two-column data frame. It is useful to deal with a lot of the output from stats methods.
You can use n_extra in print().
read_csv()skip = n lets you skip a few lines at the top of the file (good for E-Prime exports)comment = "#" will skip any lines that start with #.readr vs. base RReasons why read_csv() from readr is better than base R’s read.csv
parse_*() takes two arguments: a vector and a collection of strings to treat as NA.
To parse numbers, you can give arguments to locale – e.g., changing from 1.00 to 1,00.
Sometimes it’s useful to parse factors as character vectors until you’ve finished cleaning them up. ### Exercises
1.What function would you use to read a file where fields were separated with |?
read_delim(path_to_file, delim = "|")
## Error in is.connection(x): object 'path_to_file' not found
file, skip, and comment, what other arguments do read_csv() and read_tsv() have in common?col_names, col_types, locale, na, quoted_na, quote, trim_ws, skip, n_max, guess_max, progress.
Probably file and col_positions. col_positions says how wide each of the fixed-width columns are.
" or '. By convention, read_csv() assumes that the quoting character will be ", and if you want to change it you’ll need to use read_delim() instead. What arguments do you need to specify to read the following text into a data frame?read_csv("x,y\n1,'a,b'", quote = "'")
## # A tibble: 1 x 2
## x y
## <int> <chr>
## 1 1 a,b
read_csv("a,b\n1,2,3\n4,5,6")
## # A tibble: 2 x 2
## a b
## <int> <int>
## 1 1 2
## 2 4 5
read_csv("a,b,c\n1,2\n1,2,3,4")
## # A tibble: 2 x 3
## a b c
## <int> <int> <int>
## 1 1 2 NA
## 2 1 2 3
These ones have too many columns in one of the lower rows.
read_csv("a,b\n\"1")
## # A tibble: 1 x 2
## a b
## <int> <chr>
## 1 1 <NA>
This one has an extra \.
read_csv("a,b\n1,2\na,b")
## # A tibble: 2 x 2
## a b
## <chr> <chr>
## 1 1 2
## 2 a b
This one is fine.
read_csv("a;b\n1;3")
## # A tibble: 1 x 1
## `a;b`
## <chr>
## 1 1;3
This one needs to have the separator changed to ;.
locale()?The most important argument is probably date_names, where you can just set it to something like en or fr.
decimal_mark and grouping_mark to the same character?parse_number("123,456,789", locale = locale(grouping_mark = ",", decimal_mark = ","))
## Error: `decimal_mark` and `grouping_mark` must be different
You get an error.
What happens to the default value of grouping_mark when you set decimal_mark to ,?
parse_number("123,456,789", locale = locale(decimal_mark = ","))
## [1] 123.456
It changes to ..
What happens to the default value of decimal_mark when you set the grouping_mark to .?
parse_number("123,456,789", locale = locale(decimal_mark = "."))
## [1] 123456789
It changes to ,.
date_format and time_format options to locale(). What do they do? Construct an example that shows when they might be useful.It lets you set what the date or time format should be manually. E.g.,
str(parse_guess("01/02/2013", locale = locale(date_format = "%d/%m/%Y")))
## Date[1:1], format: "2013-02-01"
I live inside the US!
read_csv() and read_csv2()?read_csv() uses , for the separator and . for decimals. read_csv2() uses ; for the separator and , for the decimal mark.
ISO 8859 is used in Europe commonly (Greek, polish, Slovenian) as well as windows-1251 and windows-1254. Chinese uses Big5 and gb18030, Korean uses EUC-KR, Japanese uses Shift_JIS
d1 <- "January 1, 2010"
parse_date(d1, "%B %d, %Y")
## [1] "2010-01-01"
d2 <- "2015-Mar-07"
parse_date(d2, "%Y-%b-%d")
## [1] "2015-03-07"
d3 <- "06-Jun-2017"
parse_date(d3, "%d-%b-%Y")
## [1] "2017-06-06"
d4 <- c("August 19 (2015)", "July 1 (2015)")
parse_date(d4, "%B %d (%Y)")
## [1] "2015-08-19" "2015-07-01"
d5 <- "12/30/14" # Dec 30, 2014
parse_date(d5, "%m/%d/%y")
## [1] "2014-12-30"
t1 <- "1705"
parse_time(t1, "%H%M")
## 17:05:00
t2 <- "11:15:10.12 PM"
parse_time(t2, "%I:%M:%OS %p")
## 23:15:10.12
If your columns are not coming in as the right type, you can tell readr how to parse them manually.
If you want to save interim results, you can use write_rds and read_rds to preserve column types.
Tidy data follows a particular structure:
Missing data can exist explicitly (as a value of NA) or implicitly (by just not being present).
complete() takes all of the combinations of a set of columns (like subject and condition) and makes one row for each combination, filling in empty cells with NA.
fill() will replace missing cells with the most recent non-missing value (based on row index).
table1: Each row has cases and population for a single year in a single country. table2: This is a long dataset (good for plotting, sometimes). Each row has some observation (either cases or population) for a single country in a single year. table3: Here the numeric data (rate) is stored as a character vector! This is pretty obnoxious. table4a & table4b: Each table has one row per country showing values for something with separate columns per year.
table2, and table4a + table4b. You will need to perform four operations:table2 %>%
spread(key = type, value = count) %>%
mutate(rate = cases/population * 10000)
## # A tibble: 6 x 5
## country year cases population rate
## <chr> <int> <int> <int> <dbl>
## 1 Afghanistan 1999 745 19987071 0.373
## 2 Afghanistan 2000 2666 20595360 1.29
## 3 Brazil 1999 37737 172006362 2.19
## 4 Brazil 2000 80488 174504898 4.61
## 5 China 1999 212258 1272915272 1.67
## 6 China 2000 213766 1280428583 1.67
table4a_2 <- table4a %>%
gather(key = year, value = cases, `1999`, `2000`)
table4b_2 <- table4b %>%
gather(key = year, value = population, `1999`, `2000`)
table4 <- merge(table4a_2, table4b_2)
table4 %>%
mutate(rate = cases/population * 10000)
## country year cases population rate
## 1 Afghanistan 1999 745 19987071 0.372741
## 2 Afghanistan 2000 2666 20595360 1.294466
## 3 Brazil 1999 37737 172006362 2.193930
## 4 Brazil 2000 80488 174504898 4.612363
## 5 China 1999 212258 1272915272 1.667495
## 6 China 2000 213766 1280428583 1.669488
Definitely table2 is easier because you can just reshape it into wide format easily. For the two separate tables, you have to 1) know what data they show (population vs. cases) 2) reshape them 3) merge them, all before creating the rate variable.
table2 instead of table1. What do you need to do first?ggplot(table1, aes(year, cases)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country))
table2 %>%
spread(key = type, value = count) %>%
ggplot(aes(year, cases)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country))
You have to use spread() to get the right data format.
gather() and spread() not perfectly symmetrical? Carefully consider the following example:stocks <- tibble(
year = c(2015, 2015, 2016, 2016),
half = c( 1, 2, 1, 2),
return = c(1.88, 0.59, 0.92, 0.17)
)
stocks %>%
spread(year, return) %>%
gather("year", "return", `2015`:`2016`)
## # A tibble: 4 x 3
## half year return
## <dbl> <chr> <dbl>
## 1 1 2015 1.88
## 2 2 2015 0.59
## 3 1 2016 0.92
## 4 2 2016 0.17
(Hint: look at the variable types and think about column names.)
The year variable is originally numeric, but then when it becomes a column name through spread() and then a value again through gather(), it ends up as a character variable. These functions do not preserve variable type.
spread() and gather() have a convert argument. What does it do?The convert variable deals with the issue above, turning character column names into numeric/logical/integer.
table4a %>%
gather(1999, 2000, key = "year", value = "cases")
## Error in inds_combine(.vars, ind_list): Position must be between 0 and n
#> Error in inds_combine(.vars, ind_list): Position must be between 0 and n
When you use numbers for column names, you have to use tick marks to surround them or else they will not work.
people <- tribble(
~name, ~key, ~value,
#-----------------|--------|------
"Phillip Woods", "age", 45,
"Phillip Woods", "height", 186,
"Phillip Woods", "age", 50,
"Jessica Cordero", "age", 37,
"Jessica Cordero", "height", 156
)
There are two rows that have the same name and key (Phllip Woods, age). You could add an id column if you want.
preg <- tribble(
~pregnant, ~male, ~female,
"yes", NA, 10,
"no", 20, 12
)
This column should be gathered.
preg %>%
gather(key = gender, value = number, male, female)
## # A tibble: 4 x 3
## pregnant gender number
## <chr> <chr> <dbl>
## 1 yes male NA
## 2 no male 20
## 3 yes female 10
## 4 no female 12
The variables are pregnant, gender, and number.
extra and fill arguments do in separate()? Experiment with the various options for the following two toy datasets.tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"))
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e f
## 3 h i j
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"))
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e <NA>
## 3 f g i
extra lets you choose what to do with extra pieces (e.g., in row 2). The default is to drop with a warning.
fill lets you choose what to do when there are not enough pieces. By default it fills from the right (if nothing on the right, then NA).
unite() and separate() have a remove argument. What does it do? Why would you set it to FALSE?remove lets you choose if you want to keep the original column(s) or omit them. You can set it to FALSE if you want to keep the original column in the data frame.
separate() and extract(). Why are there three variations of separation (by position, by separator, and with groups), but only one unite?separate() splits a single column into multiple columns based on a separator or position. extract() uses a regular expression to get capturing groups and then puts them into columns. Thus, extract() lets you look for something different with each section of the column/string.
There are lots of ways to split a string, but only one way to paste it back together!
fill arguments to spread() and complete().In spread(), fill lets you choose what to replace missing values (explicit and implicit) with.
In complete(), fill also lets you choose what to use for missing combinations (explicit and implicit), but you can set it for each variable.
fill() do?It lets you choose whether to copy entries up the data frame or down the data frame.
na.rm = TRUE just to make it easier to check that we had the correct values. Is this reasonable? Think about how missing values are represented in this dataset. Are there implicit missing values? What’s the difference between an NA and zero?table(who$new_ep_f5564)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 253 90 53 40 31 24 32 21 12 19 17 12 10 12 4
## 15 16 17 18 19 20 21 22 23 24 25 26 28 29 30
## 7 8 5 7 6 5 4 2 5 3 4 4 8 6 7
## 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
## 3 5 5 2 7 5 4 3 4 10 3 4 8 2 1
## 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 11 1 2 2 5 2 2 5 2 3 3 5 2 3 2
## 61 62 63 64 65 69 70 73 74 75 76 78 80 82 83
## 3 2 2 2 3 1 1 1 1 2 1 2 2 3 2
## 84 87 91 92 93 95 101 102 106 107 108 109 110 111 112
## 2 1 1 3 1 1 1 1 2 1 1 1 1 1 2
## 113 115 117 118 120 121 123 124 125 126 127 128 129 131 132
## 2 1 2 2 1 2 1 2 1 1 1 1 2 1 2
## 134 137 138 139 140 141 142 143 144 145 146 148 149 150 151
## 2 1 1 1 1 1 3 1 1 1 1 1 1 1 1
## 152 155 156 161 162 163 164 170 172 173 174 176 181 182 183
## 3 2 2 1 2 1 1 1 1 2 2 1 1 1 1
## 186 187 188 192 195 200 207 209 231 238 244 248 253 256 266
## 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1
## 271 272 283 287 301 302 307 317 339 348 357 360 361 374 379
## 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 380 384 385 386 387 388 393 396 397 399 409 415 420 421 435
## 1 1 1 1 1 1 1 1 2 1 1 2 3 1 2
## 442 463 475 483 493 513 520 563 571 578 582 584 614 620 628
## 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1
## 631 702 849 901 969 992 1118 1186 1498 1726 1894 4684
## 1 1 1 1 1 1 1 1 1 1 1 1
table(is.na(who$new_ep_f5564))
##
## FALSE TRUE
## 1017 6223
In this dataset it seems like missing values are truly missing, which implies that it has mostly explicit missingness. table() shows that there are entered values of 0.
mutate() step? (mutate(key = stringr::str_replace(key, "newrel", "new_rel")))who %>%
gather(key, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>%
mutate(key = stringr::str_replace(key, "newrel", "new_rel")) %>%
separate(key, c("new", "var", "sexage")) %>%
select(-new, -iso2, -iso3) %>%
separate(sexage, c("sex", "age"), sep = 1)
## # A tibble: 76,046 x 6
## country year var sex age value
## <chr> <int> <chr> <chr> <chr> <int>
## 1 Afghanistan 1997 sp m 014 0
## 2 Afghanistan 1998 sp m 014 30
## 3 Afghanistan 1999 sp m 014 8
## 4 Afghanistan 2000 sp m 014 52
## 5 Afghanistan 2001 sp m 014 129
## 6 Afghanistan 2002 sp m 014 90
## 7 Afghanistan 2003 sp m 014 127
## 8 Afghanistan 2004 sp m 014 139
## 9 Afghanistan 2005 sp m 014 151
## 10 Afghanistan 2006 sp m 014 193
## # ... with 76,036 more rows
You will have values of key that mean the same thing but are counted as different variables because there are slight variations in how they were named.
iso2 and iso3 were redundant with country. Confirm this claim.who %>%
group_by(country, iso2, iso3) %>%
summarise(freq = n()) %>%
group_by(country) %>%
summarise(freq2 = n()) %>%
filter(freq2 > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: country <chr>, freq2 <int>
No country has more than one combination of country, iso2 and iso3.
who %>%
gather(key, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>%
mutate(key = stringr::str_replace(key, "newrel", "new_rel")) %>%
separate(key, c("new", "var", "sexage")) %>%
select(-new, -iso2, -iso3) %>%
separate(sexage, c("sex", "age"), sep = 1) %>%
group_by(country, year, sex) %>%
summarise(count = n()) %>%
ggplot(aes(year, count)) + geom_point() + facet_wrap(~sex) + theme_bw()
It doesn’t really make sense to plot all 200+ countries. Perhaps if I had region information I could color-code the points on the plot. Ideally, there would be lines connecting the points from a given region to give a sense of change over time.
Relational data takes place over multiple tables.
Mutating joins: add new variables to one data frame from matching observations in another.
Filtering joins: filter observations from one data frame based on whether or not they match an observation in the other table.
Set operations: treat observations as if they were set elements.
You’d need to join airports on flights to get the lat/long of the airports.
weather and airports. What is the relationship and how should it appear in the diagram?origin in weather = faa in airports.
weather only contains information for the origin (NYC) airports. If it contained weather records for all airports in the USA, what additional relation would it define with flights?You would be able to join weather on flights either by origin or destination, depending on which data you wanted.
flights %>%
left_join(select(planes, tailnum, seats)) %>%
group_by(month,day) %>%
summarise(n_seats = sum(seats, na.rm = TRUE)) %>%
arrange(n_seats)
## # A tibble: 365 x 3
## # Groups: month [12]
## month day n_seats
## <int> <int> <int>
## 1 2 9 59809
## 2 11 28 74750
## 3 12 14 75186
## 4 11 29 78080
## 5 2 2 78490
## 6 1 26 79846
## 7 1 19 79947
## 8 12 7 80110
## 9 1 12 80170
## 10 9 7 82605
## # ... with 355 more rows
This data frame shows the total number of seats available in all of the flights departing on a given day from NYC airports. The fewest seats available are shown at the top of the dataset. The primary key is the combination of month and day, which allows you to join it (not so well) to other tables.
We could also make a data frame of days that people might not travel often (Christmas, New Years’, etc.), but the data shows that this is not really the case. However, it might be useful if we want to filter those days (or other holiday travel) out of the other data frames.
flights.flights %>%
mutate(id = row_number())
## # A tibble: 336,776 x 20
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 13 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, id <int>
library(Lahman)
library(babynames)
library(nasaweather)
library(fueleconomy)
Lahman::BattingLahman::Batting %>%
count(playerID, yearID, stint) %>%
filter(n > 1)
## # A tibble: 0 x 4
## # ... with 4 variables: playerID <chr>, yearID <int>, stint <int>, n <int>
Combo of playerID, yearID and stint.
babynames::babynamesbabynames::babynames %>%
count(year, sex, name) %>%
filter(nn > 1)
## # A tibble: 0 x 4
## # ... with 4 variables: year <dbl>, sex <chr>, name <chr>, nn <int>
Combo of year, sex, and name.
nasaweather::atmosnasaweather::atmos %>%
count(lat, long, year, month) %>%
filter(n > 1)
## # A tibble: 0 x 5
## # ... with 5 variables: lat <dbl>, long <dbl>, year <int>, month <int>,
## # n <int>
Combo of lat, long, year, and month.
fueleconomy::vehiclesfueleconomy::vehicles %>%
count(id) %>%
filter(n > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: id <int>, n <int>
Just id!
ggplot2::diamondsI don’t think there is one for this table!
(You might need to install some packages and read some documentation.)
Batting, Master, and Salaries tables in the Lahman package.playerID is common to Batting, Master, and Salary. To combine Batting and Salaries, you shouls use yearID as well.
Draw another diagram that shows the relationship between Master, Managers, AwardsManagers.
To join Managers and AwardsManagers, you can use playerID and yearID, probably filtering by plyrMgr == "Y". Then you can join in Master by playerID.
How would you characterise the relationship between the Batting, Pitching, and Fielding tables?
They have the same primary key – a combo of playerID, yearID, and stint.